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* ABSTRACT 

A study of multiaccess computer communications has characterized the distributions underlying an 
elementary model of the user-computer interactive process. The model used is elementary in the 
sense that many of the random variables that generally are of interest in computer communications 
studies can be decomposed into the elements of this model. Data were examined from four 
operational multiaccess systems, and the model is shown to be robust; that is, each of the variables 
of the model has the same distribution independent of which of the four systems is being examined. 
It is shown that the gamma distribution can be used to describe each of the continuous variables of 
the model, and that the geometric distribution can be used to describe the discrete variables. 
Approximations to the gamma distribution by the exponential distribution are discussed for the 
systems studied. 
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Estlmates of Distributions of Random Variables 
for Certain Computer Communications Traffic Models 

by 

E. Puchs and P. E. Jackson 
Bell Telephone Laboratories, Incorporated 
Holmdel, New Jersey 

ABSTRACT 

A study of multiaccess computer communications has 
characterized the distributions underlying an elementary model 
of the user-computer interactive process. The model used is 
elementary in the sense that many of the random variables that 
generally are of interest in computer communications studies 
can be decomposed into the elements of this model* Data were 
examined from four operational multiaccess systems, and the 
model is shown to be robust; that is, each of the variables of 
the model has the same distribution independent of which of ' 
the four systems is being examined. It is shown that the gamma 
distribution cam be used to describe each of the continuous var- 
iables of the model, and that the geometric distribution can be 
used to describe the discrete variables. Approximations to the 
gamma distribution by the exponential distribution are discussed 
for the systems studied. 
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Estimates of Distribution of Random Variables 
for Certain Computer Communications Traffic. Models 

by 

E. Fuchs and P. E. Jackson 
Bell Telephone Laboratories, Incorporated 
Holmdel, New Jersey 



Introduction 

Since time sharing burst on the world some 6 or 7 years 
ago, many analytical studies have been published of the behavior of 
such systems. t 1 ^, 7, 9, 13, 11,15,16, 20] In generalj the completion 

of an analytical study of a real process requires several steps to 
be performed: construction of a process model, analysis of the 
model, estimation of the model parameters, and verification of 
the results. It is sad to report that in almost all of the pub- 
lished studies, the last two steps are omitted.* It is evident 
that the basic reasons for these omissions are (1) the difficul- 
ties encountered in the collection of necessary data due to the 



* rpii -i 

The pioneering work of Alan Scherr L J was of course supported 
by extensive measurements on the M.I.T. Project MAC CTSS 
System, and his results were verified by simulations. Other 
investigations which were supported by measurements were 

undertaken for the JOSS system at RAND Corp.'" 2 ' 25 -', the Q-32 

Time Sharing System at S.D.C^ 6,17,18,26 ^ , and additional 

investigations at Project MAc'' 10 ^. Each of these investigations 
was performed for a specific problem for the system at hand, with 
no attempt at generalization. However, the results of these 
studies have been quoted in lieu of measurements by authors of 
more general studies. An excellent summary and comparisons 

of these investigations may be found in'- 23 -'. 
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complexity of requisite simulations, the potential impairment 
of the efficiency of real systems by the measurement process, 
and the problem of avoiding violation of the proprietary con- 
straints of systems applications; (2) the costs in time and 
dollars for conducting such studies; and (3) the questionable 
utility of such data in light of the rapid evolution of system 
capabilities and user characteristics. Nevertheless, as was first 
pointed out by Sackman L JJ in 1967, inferences drawn from such 
models for the design of systems without empirical determination 
of parameter values and without testing of the model with the 
estimated parameters rest on extremely shaky foundations. 

Clearly, the third reason is the most difficult to 
respond to. Many systems are changing so rapidly that a detailed 
characterization of any one will probably be outdated before it 
is completed. However, the architecture of computer communica- 
tion systems has matured to the point that the potential for 
insight gained from analysis of operational systems for testing 
models and for forming a basis for research aimed at improvement 
far outweigh the drawback of obsolescence. Indeed, this situation 
calls for continued study and review. 

If analytical models are to be of value in the design 
of systems, then the first two problems can be resolved. Efforts 
have been underway for some time at Bell Telephone Laboratories 
to model the user-computer interaction process in on-line multi- 
access computer systems as an aid in the development of new 
computer communication systems and services. The studies include 
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extensive efforts at the collection of data from representative 

working systems to obtain realistic estimates of the parameters 

ri2l 

of the models- In a previous paper, J Jackson and Stubbs re- 
ported some of the results of these efforts; specifically, a 
data stream model of the interaction process was presented, to- 
gether with estimates of the average values of the basic random 
variables of the model as obtained from measurements on working 
systems. In this paper, we report additional results. First, 
we present the results of goodness of fit tests in which standard 
probability density functions are fitted to the empirical estimates 
of the distributions of the random variables of the model. Second, 
we examine the significance levels, of the fits for the various 
probability density functions and find that analytically tractable 
probability density functions can be used for the variables with 
reasonable significance levels. Third, we note a consistency 
between systems of widely varying types and applications in 
characterization of key variables and comment on the significance 
of this consistency. 

In a small way, these analyses are analogous to the 
early studies of Erlang'- 8 -' and others 60-70 years ago, + in which 
representative examples of traffic data were collected for the 



In every case these data are obtained on the premises of 
the computer service provider and with his full permission 
and cooperation. To ensure the privacy of the four systems 
under discussion, however, they are not identified by name. 

+ Molina C21] reports that G. T. Blood of the AT&TCo in 1898 
found a close agreement between the terms of a binomial 
expansion and the results of the observations on the dis- 
tribution of busy telephone calls. This is the earliest 
reference that we have been able to find to empirical 
studies aimed at verification of assumptions employed in 
telephone traffic modeling. 
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purpose of characterizing local and toll telephone system be- 
havior. The Poisson arrival rate process and exponential inter- 
arrival time distribution were results of some of the earliest 
of these studies. It is interesting to note that the validity 
of these characterizations has been retained throughout the years 
despite the many technological changes in telephone systems and 
sociological changes in telephone usage. 

To provide a framework for presentation of the results, 
we first give an overview of the study methods and review the data 
stream model presented in Reference [12]. We then discuss the tech- 
niques employed to characterize the variables. Finally, we pre- 
sent the results of the study. 
Methods and Models 

The modus operandi for this study is an in-depth analysis 
of selected multiaccess computer communication systems. These 
systems were selected on the basis that they are representative 
of the advanced state of the art, that the providers of the 
particular system are knowledgeable in communications, that 
the systems are fully operational with the initial break-in 
period accomplished, and that the computer service providers 
are willing to participate in the study. More detail on the 
selection procedure is given in Reference [12]. 

The data which are utilized in the results reported 
here are the detailed relationships of the flow of message 
characters to and from users and computers during on-line trans- 
actions. The model describes the communications process in terms 
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of random variables which give intercharacter times and the 
sizes of clusters of characters as they are transmitted through 
the communication interface so the raw data could be collected 
at the computer ports of active multiaccess computer systems. 
The model did not require nor did we collect data from internal 
computer processes such as the length of various internal queues. 

Figure 1 illustrates the data stream model. A "call" 
(or a connect-disconnect time period) is represented as the 
summation of a sequence of time periods during which the user 
sends characters without receiving, interleaved with time per- 
iods during which he receives characters without sending. (This 
implies half-duplex operation. Simple modifications to the model 
would allow the accommodation of full-duplex operation.) The 
periods during which the user is sending characters to the com- 
puter are defined as user burst segments. The periods during 
which he is receiving characters sent from the computer are 
computer burst segments. A burst segment, by definition, begins 
at the end of the last character of the previous computer burst 
segment. Similarly, a computer burst segment begins at the end 



It is apparent that a model which portrays the interplay 
of the internal computer processes, such as memory manage- 
ment and processor time scheduling algorithms, with the 
communication processes would be more satisfactory for 
joint optimization of computer and communication per- 
formance. However, acquisition of data describing the 
former processes was not within the scope of this study. 



-211- 



of the last character sent by the user. The first burst segment 
of a call begins when the call is established, and the last burst 
segment ends when the call is terminated as measured at the com- 
puter interface. 

Within a given burst segment, there are periods of line 
activity and of line inactivity. The first inactive period of 
a user burst segment is defined as think time. That is, think 
time is the time that elapses from the end of the last previous 
computer character until the beginning of the first user character 
in that burst segment. In most cases, think time is employed 
by the user to finish reading the previous computer output and 
to think about what to do next. The corresponding inactive per- 
iod in a computer burst segment is called idle time. In some 
systems idle time represents time during which the user waits 
for the return of "line feed", after sending "carriage return"; 
in other systems, idle time represents the time during which 
the user's program is being processed or is in queue » The re- 
maining inactive periods within a burst segment are called inter- 
character times and interburst times. A prerequisite for their 
definition is the definition of a burst. 

Two consecutive characters are defined as belonging to 
the same burst if the period of inactivity between the characters 
is less than one-half character width. Thus, each burst is the 
longest string of consecutive characters where the period of 
inactivity between any two consecutive characters is less than 
one-half character width. All of the characters in a burst must, 
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of course, be transmitted from the same party (user or computer). 
For example, every character of an unbroken string of characters 
sent at line .rate is in the same burst. 

For characters within the same user burst, an inactive 
time between two consecutive characters is called a user inter- 
character time. The corresponding variable for computer bursts 
is computer intercharacter time. For bursts within the same 
user (computer) burst segment, the inactive time between two 
consecutive bursts is called a user (computer) interburst time. 
Five final variables of the data stream model are: number of 
user bursts per burst segment, number of computer bursts per 
burst segment, number of characters per user burst, number of 
characters per computer burst, and temporal character width 
(time from start to end of one character). 
Collected Data and Analysis 

During the study, data have been collected for a large 
number of transactions for each of several multiaccess computer 
systems. Data from four of the systems are discussed in this 
paper. These systems are labeled A, B, C and D. Systems A and 
B have the same computer equipment and basically the same mix 
of computer applications (scientific/engineering programming and 
problem solving); although the average loads supported by the 
two systems during the study periods were quite different. Sys- 
tem C has computer equipment different from each of the others 
and its mix of user applications is oriented toward business 
problem solving. System D also has computer equipment different 
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froin each of the others, and its applications are data collection 
and data dissemination in an inquiry /response method of operation. 
All four systems serve low-speed, half-duplex, teletypewriter- 
like terminals. Table I summarizes the salient characteristics 
of these systems, 

TABLE I 

Systems 

A B C D 

Brand X Brand X Brand Y Brand Z 



Computer Type 

Transmission Speed 
(Characters/Second) 

Primary Application 



Load 



10 10 15 

Scientific Scientific Business 



Moderate 



Heavy 



Moderate 



10 

Inquiry/ 
Response 

Light/ 
Moderate 



The random variables of Figure 1 are of two types. Some 
are discrete, such as the number of characters per burst. Others 
are continuous, such as think time. Modeling techniques most 
commonly used in computer communications studies include queueing 
processes, renewal processes, birth-death models, Markov processes, 
and to a limited extent, flow models. Most key random parameters 
of models used in computer-communication studies are either inter- 
event times such as times between arrivals at a server or burst 
length, counts such as the number of arrivals in a batch arrival 
process. In solving these types of models, only a very few random 
functions are tractable, and in some cases allowable. In the 
category of desirable functional forms fall the Poisson, 



The term load denotes the relative occupancy of the processor 
due to on-line demands and background batch work (if any); 
nothing Is implied directly as to the load on the communication 
channel . 
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geometric and binomial distributions for discrete processes, and 
the gamma distribution family for continuous processes. Hence, 
we are extremely interested in the extent to which the key para- 
meters of 3uch models can be described by these few desirable 
distributions . 

Data collected from the communication lines at the 
computer ports of the four operating systems described above 
were used to seek desirable distributions to describe each of 
the random variables of the data stream model. These data were 
laundered to remove ambiguities and were then partitioned into 
sets describing each of the variables. For each set of data 
for each variable for each system, goodness-of-fit tests were 
performed to ascertain which standard probability functions 
could be used to describe the variables. 

The set of distributions used for goodness-of-f it 
tests included the normal, Cauchy, Laplace, chi-square, exponential, 
hyperexponential , gamma, and lognormal distributions for continuous 
variables and the geometric (with and without mass at the origin), 
uniform, Poisson, compound Poisson and binomial distributions 
for discrete variables. For each variable, a compound goodness- 
of-fit test was performed where the parameters of the hypothe- 
sized distributions (those being tested) were adjusted so that 



As the existing tests for goodness-of-f it were not satisfactory 
for our purposes because of their low power or excessive 
computation time, a new test was devised. This test is briefly 
outlined in the Appendix, 
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the mean and variance for a two-parameter distribution were the 
same as the sample mean and sample variance. For a single- 
parameter distribution the mean of the distribution was equated 
to the sample mean. 

Results of the goodness-of-f it tests are shown in 
Table II. From the table, we see that the geometric distribution 
can be used to describe every discrete process but one (the 
single exception is an impulse function which is a degenerate 
form of the geometric distribution) . Similarly, each of the 
continuous random variables of the model can be described by 
the gamma distribution, and the think times, idle times and 
interburst times can be described additionally by the lognormal 
distribution. These results are significant for two reasons. 

First , the data stream model, which is elementary in 
the sense that many of the variables that generally are of 
interest in computer communication studies can be decomposed 
into the elements of the model, is shown to be robust; that is, 
each of the variables of the model is described by the same 
distribution independent of the computer system being examined.* 
These results were obtained in spite of the fact that three 

* Although the truth of the statement for the "number of characters 
per user burst" is artificial, it is made because even for that 
case the same distributional form can be used in practice with no 
operational difficulty by chosing appropriate parameters for the 
distribution. 
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TABLE II 



RESULTS OP GOODNESS OF FIT TESTS 
ACCEPTABLE* DISTRIBUTIONS 1 " 



Random Variable 

No. of Burst Segments 
per Call 

Think Time 

User Interburst Time 

Computer Interburst Time 

No. Bursts/User 
Burst Segment 

No . Bursts/Computer 
Burst Segment 

No. Characters/ 
User Burst 

No. Characters/ 
Computer Burst 

User Intercharacter Time 

Computer Intercharacter 
Time 



G 

L,r 
L,r 

L 3 r 



G 

r 



B 



G 

L 5 r 
L,r 



L 5 r 



G 

r 



Systems 
C 



G 

L,r 
L,r 
L,r 



G 

N/A 



G 3 CP 
L,T 

L,r 
L,r 



G,CP 
G 

r 



Acceptable at the five percent level. 

r - gamma distribution, L - lognormal distribution G - geometric 
distribution, CP - compound Poisson distribution, I - Constant at 
X = 1. 
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different computer types and operating systems were investigated. 

In addition, the computer loads and programming applications were 

different. Thus, in modeling data communication systems, we can 

apply the analytical results from a long-holding time system to 

other long-holding-time systems merely by changing the parameters 

F121 

of the distributions. Jackson and Stubbs L J have examined the 
mean values of the model variables for the first three systems 
and make the following observations: 

1. Delays introduced by the computer (primarily idle time 
and computer interburst delay) can be a large component 
of holding time and are affected by the number of simul- 
taneous users on the system, probably by the computer 
scheduling algorithm, and by the characteristics of the 
communications control unit. 

2. The average number of characters sent by the computer 
to the user is an order of magnitude greater than the 
number of characters sent by the user to the computer. 

3. Delays introduced by the user are a significant contri- 
butor to average holding time and are remarkably close 
in absolute values for the four systems studied. 

These three observations are examples of information 
that may be employed by system designers in investigations into 
improved communications for multiaccess computers. In modeling 
(probabilistically) the behavior of present and proposed systems 
to determine their sensitivity to particular elements of the data 
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stream model, the parameters of the distributions need only be 
changed and not the distributions themselves. These data are 
equally valuable for investigations into computer operating systems. 
For example, one might Investigate changes in computer scheduling 
algorithms as reflected in changes in idle time and interburst 
delay parameters, changes in transmission speed from computer to 
user and the converse, and changes in terminal characteristics which 
may Influence (hopefully reduce) user delays. Indeed, recently 
there have been reported many investigations into the performance 
of scheduling algorithms as measured by response time^ 1, 3*7*9*13* 
1^*15* 16,20] ^ Almost without exception, these Investigations 
hypothesize arrival rates of requests for CPU time without the 
support of measurements. Since such arrivals can be approximated 
from the variables of the data stream model, the above observations 
as to the efficacy of the results reported in this paper are 
demonstrated . 

Second , the particular distributions obtained in Table 
II are tractable and are useful in further analytical studies. For 
example, the geometric distribution was obtained for the discrete 
distributions and the gamma family for the continuous distributions. 

Table III shows the coefficients of variation, V, for 
the continuous variables for the four computer systems investigated.* 
Since the exponential distribution belongs to the gamma distribu- 
tion family and is the special case where V=l, for certain applica- 
tions it may be possible to use the exponential distribution to 



* For one system, the user terminal had an automatic response 
at the end of a computer burst segment rather than a true 
user r think time" response. For this system, the estimated 
value of V for the think time distribution was 0.72, close to 
that for the hyperexponential distribution (V=0.7l). 
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describe the arrival and delay processes. To illustrate the 
similarity between the exponential distribution and the gamma 
distribution with 1.0 <V< 1.8, Table IV is included, 

TABLE III 

COEFFICIENT OF VARIATION FOR GAMMA DISTRIBUTIONS 

Systems 





A 


B 


C 


D 


Think Time 


1.56 


1.64 


0.72 


1.61 


Idle Time 


1.09 


1.54 


1.59 


1.45 


User Interburst Time 


1.39 


1.59 


1.49 


1.54 


Computer Interburst Time 


1.56 


1.61 


1.59 


1.64 


User Intercharacter Time 


1.67 


1.54 


1.67 


1.59 


Computer Intercharacter Time 


1.67 


1.67 


1.59 


1.56 



TABLE IV 

DIFFERENCE BETWEEN CUMULATIVE DISTRIBUTIONS OF GAMMA 



AND EXPONENTIAL VARIATES 





Error in 


Percent* for 


Independent 


Variable i 


Gamma 


One-Half 


Twice 


Maximum 


Coefficient 


Mean 


Mean 


Mean 


Error 


of Variation 


Value 


Value 


Value 


in Tail 


1.0 


0.0 


0.0 


0.0 


0.0 


1.2 


14.5 


3.6 


2.0 


2.4 


1.4 


23.1 


6.22 


3.8 


5.0 


1.6 


28.2 


7.3 


6.4 


8.6 


1.8 


30.6 


6.7 


10.2 


13.7 


* With respect 


to gamma 


distribution. 
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The error listed in each column is the difference 
between the cumulative distributions at the point given. The 
last column lists the largest value of this error in the upper 
tail region. From the table, we can see that the approximation 
becomes less accurate as V becomes larger and is less accurate 
for smaller values of the independent variable than for larger 
ones. We further observe that even close to the origin, the class 
of gamma distributions defined by V>1 has the same general shape 
as the exponential distribution. For much analytical work, the 
behavior of the distribution function in the neighborhood of the 
mean and in the upper tail are of the most interest. For these 
types of problems, if errors of the magnitudes shown in Table IV 
are allowable, (or alternatively if the relative coefficients of 
variation shown in Table III are tolerable; note that the coeffi- 
cients of variations of Table III may be interpreted as relative 
to the exponential distribution, which has a coefficient of varia- 
tion of unity) then the exponential distribution may be used in 
place of the gamma distribution. 

Thus, assuming independent interarrivals, one can use 
the Poisson process to describe any of the arrival processes and 
have the large body of queueing theory at one's disposal to analyze 
the communication process of time-shared computer systems. Even 
for those distributions where the exponential interarrival approx- 
imation is not useful, the gamma distribution is tractable for 
some types of analyses. 
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Conclusions 

In analyzing computer communication systems for time- 
sharing applications, the results of this work have shown that 
a variety of techniques can be applied to model the processes. 
Since the input traffic process has been characterized in terms 
that are usually tractable for analytical models, realistic re- 
sults may be obtained using standard analytical techniques. In 
some models where the estimated distributional forms are not 
amenable to analysis, appropriate approximation techniques are 
available . 

This work has shown that the communication process 
between a multiaccess computer and a user at a teletypewriter- 
like terminal can be represented by an elementary model from which 
more complex models may be constructed. Further, by using real 
data from operational multiaccess systems, we have shown that 
the model is robust and that the distributions obtained for each 
of the variables are tractable. In certain cases, the character 
arrival process can be approximated by a Poisson process. Thus, 
in modeling the communication process of long-holding-time multi- 
access computer systems only the parameters of the distributions 
for the variables change for various computer types, applications 
and system loading. 

These observations can be combined with the observations 
[12] 

of Jackson and Stubbs on computer- introduced delays, user- 
introduced delays and the relative amounts of information flow 
in each direction on the communication line to give a comprehensive 
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picture of the communication process. For example, these analyses 
support analytical and simulation studies at Bell Telephone 
Laboratories which seek solutions to computer access data communi- 
cations problems, cf, Chu's, Meltzer's and Pile's studies 

[4 5 19 22] 

on asynchronous time division multiplexing. 

Studies of multiaccess computer communications are 
continuing. Data are being collected from systems with different 
terminal types, system configurations, average holding times, and 
user applications. Analyses of data for these new systems will 
expand our understanding of the computer-communication processes 
involved as we have a broader base from which to draw conclusions and 
make comparisons. 
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APPENDIX 

A Fourier Series Teat of Goodness of Fit 

The examination of the data for goodness of fit 
posed considerable problems. The objective of this part of 
the study was to determine the suitability of analytically 
tractable probability density functions (p.d.f.'s) to 
describe the significant random processes. 

The classical goodness- of- fit tests which were 
first applied in this study were the chi-square test, the 
Kolmogorov-Smirnov test, and the Cramer-Von Mises test. 
These tests suffer from the following deficiencies. The 
power of the chi-square test is very poor, and the realized 
significance level is sensitive to the number and placement 
of class intervals. The Kolmogorov-Smirnov test and the 
Cramer-Von Wises test require ordering the data, which 
requires considerable computer time even with the most 
efficient algorithms for the quantities of data involved 
in this study.* Further, since we are interested in the 
suitability at an acceptable significance level of 
analytically tractable p.d.f.'s, the tradeoff between 
significance level, power, and number of sample points is 
of concern to us. In this regard, the only available 

* The advantage of the test used (described in the following 
paragraphs) over the Kolmogorov-Smirnov test or the Cramer- 
Von Mises test, in units of computer time, is roughly 
25-50 to one. 
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expression for the power of the Kolmogorov-Smirnov te3t is a 
lower bound, and no expression for the power of the Cramer- 
Von Mises test Is known at this time. An additional complication 
is what we need to perform a compound goodness-of-f it test where 
we estimate the parameters of the distribution from the data as 
well as testing a given distribution. 

These difficulties led to the development of a new test 
by Jackson, called the Fourier Series Test of Goodness-of-Fit , 
which has the following advantages: 

1. No decisions about number and spacing of class 
intervals are required as with the construction 
of histograms required for the chi-square test; 

2. The set of observations need not be ordered; 

3. Estimators are easily updated for additions to the 
data base by recursive relationships which require 
only minimal operations on the new data; 

4. The power of the test is comparable to the power 
of the best of the classical tests for reasonable 
alternatives; 

5. The computation time for the Fourier test is less 
than the computation time for the classical tests; 
and 
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6. Using the limiting distributions, the power of the 
Fourier test can be computed analytically for 
both the simple and compound hypothesis tests, 
while it cannot be computed for most of the 
classical tests. 
Briefly, the technique proceeds as follows: 
The probability density function is estimated from the data by 
a finite linear combination of sine and cosine functions harmonic 
over the region of support of the function - a truncated Fourier 
series - where the coefficients of the series are estimated from 
the data, and the number of terms of the series are determined 
by a minimization technique. Then, for each prespecified standard 
distribution, the hypothesis that the estimated Fourier series 
function is not significantly different from the Fourier series 
expansion of the p.d.f. of the standard distribution is tested. 
The test statistic used is a function of the squared differences 
between the coefficients of the Fourier series expansion of the 
estimated distribution and the coefficients of a Fourier series 
expansion of the hypothesized standard distribution. 
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Abstract 

We study an approach to text categorization that combines distributional clustering of words and a 
Support: Vector: Machine (S VMJ classiiier. ■ This :>yord-e luster representation -is- compnt ed using trie 
recently introduced Information Bottleneck method^ wrncfe generates: a: co 
.recitation; 

yields high performance it) text categorization. This novel combination of SVM with word-cluster 
representation is compared with SVM-based categorization using the simpler bag-of- words (BOW) 
representation. The comparison is performed over three known datasets. On one of these datasets 
(the 20. Newsgroups) : t he : method : ba sed: on word :c lusters : s ignificantly : outperf onus the : word-based 
representation mierms of categ qriza lion accuracy or rcpreseril^tipn efficiency: On the two pother sets 
(Reuters-21 578 and WebKB) tlie word-based representation slightly outperforms the word-cluster 
representation. We investigate the potential reasons for this behavior and relate it to structural 
differences between the datasets. 



1. Introduction 

The most popular approach to text categorization has so far been relying on a simple document 
representation in a word-based "input space". Despite considerable attempts to introduce more 
sophisticated techniques for document representation, like ones that are based on higher order word 
statistics (Caropreso et aL, 2001), NLP (Jacobs, 1992; Basili et ah, 2000), "string kernels" (Lodhi 
et ai,, 2002) and even representations based on word clusters (Baker and McCailum, 1998), the 
simple minded independent word-based representation, known as Bag-Of-Words (BOW), remained 
very popular. Indeed, to-date the best categorization results for the well-known Reuters-21578 and 
20 Newsgroups datasets are based on the BOW representation (Dumais et aL, 1998; Weiss et aL, 
1999; Joachims, 1997). 

©2003 Kojj Bekkerman, Ran. El-Yaniv, NaHali Tisltby, and Yoad Winter. 



Bekkerman, El-Yaniv, Tishby, and Winter 



In this paper we empirically study a familiar representation technique that is based on word- 
clusters. Our experiments indicate that text categorization based on this representation can outper- 
form categorization based on the BOW representation, aJ though the performance that this method 
achieves may depend on the chosen dataset. These empirical conclusions about the categoriza- 
tion: ipeifoiTO 

cently introduced Information Bottleneck (IB) clustering framework (Tishby et al. 5 1.999; Slonim 
and Tishby, 2000, 2001) for generating document representation in a word cluster space (instead of 
word space), where each cluster is a distribution over document classes. We show that the combina- 
tiqiKof :t^^ 

1992; Scholkopf and Smola, 2002) allows for high performance in categorizing three benchmark 
datasets: 20 Newsgroups (20NG), Reuters-21578 and WebKB. in particular, our categorization of 
20NG outperforms the strong algorithmic word-based setup of Dumais et al. (1998) (in terms of cat- 

egonaation':*^ 

results for the 10 largest categories of the Reuters dataset. 

This representation using word clusters, where words are viewed as distributions over docu- 
ments^ 

clustering" idea of Pereira et al. (1993). This technique enjoys a number of intuitively appealing 
properties and advantages over other feature selection (or generation) techniques. First, the di- 
mensionality reduction computed by this word clustering implicitly considers correlations between 
the| : various : features {terms; : or -words) \y. -In \ contrast; ; popular ^ftiter^hased" : greedy: \ approaches: • for 
feature selection such as Mutual Information, Information Gain and TFIDF (see, e.g., Yang and 
Pedersen, 1997) only consider each feature individually. Second, the clustering that is achieved 
by the IB method provides a good solution to the statistical sparseness problem that is prominent 
in the straightforward word-based (and even more so in w-gram-hased) document representations. 
Third, the clustering of words generates extremely compact representations (with minor information 
compro^ 

advantages, the IB word clustering technique is formally motivated by the Information Bottleneck, 
principle, in which the computation of word clusters aims to optimize a principled target function 
(see Section 3 for further details). 

Despite these conceptual advantages of this word cluster representation and its success in cate- 
gorizing the 20NG dataset, wc show that it does not improve accuracy over BOW-based categoriza- 
tion, when it is used to categorize the Reuters dataset (ModApte split) and a subset of the WebKB 
dataset. We analyze this phenomenon and observe that the categories of documents in Reuters and 
WebKB are less "complex" than the categories of 20NG in the sense that documents can almost be 
"optimally" categorized using a small number of keywords. This is not the case for 20NG, where 
ihexoritj^ 

The rest of this paper is organized as follows. In Section 2 we discuss the most relevant related 
work. Section 3 presents the algorithmic components and the theoretical foundation of our scheme. 
Section 4 describes the datasets we use and their textual preprocessing in our experiments. Section 5 
presents our experimental setup and Section 6 gives a detailed description of the results. Section 7 
discusses these results. Section 8 details the computational efforts in these experiments. Finally, in 
Section 9 we conclude and outline some open questions. 
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2, Related Results 

Iri;this;:se^ 

limit the discussion to relevant feature selection and generation techniques, and best known cate- 
gorization results over the corpora we consider (Reuters-21578, the 20 Newsgroups and WebKB). 
For more comprehensive surveys on text categorization the reader is referred to Sebastiani (2002); 
Singer and Lewis (2000) and references therein. Throughout the discussion we assume familiarity 
with standard terms used in text categorization. 1 

We start with a discussion of feature selection and generation techniques. Dumais et al. (1998) 
report on experiments with multi-labeled categorization of the Reuters dataset. Over a BOW binary 
representation (where each word receives a count of 1 if it occurs once or more in a document and 
0;p*emisef^ C 
denote the set of document categories and let X c €:{0, 1}; be a binary random variable denoting the 
event that a random document belongs (or not) to category c G C. Similarly, let X w j£;{0, 1} be a 
random variable denoting the event that the word w occurred in a random document. The Mutual 
Information between X c and X w is 

1{XM £ P(X Ci X„) log ^^^y (D 

"Note that when evaluating I(Xc,X») from a sample of documents, we compute P(X Ci X w ) 9 P(X C ) and 
P(X W ) using their empirical estimates. 2 For each category c s all the words are sorted according to 
decreasing value of I(X Ci X w ) and the k top scored words are kept, where k is a; pre-specified : or data- 
dependent parameter. Thus, for each category there is a specialized representation of documents 
projected to the most discriminative words for the category. 3 In the sequel we refer to this Mutual 
Information feature selection technique as "MI feature selection" or simply as "MI". 

MI feature selection method yields a 92.0% break-even point (BEP) on the 10 largest categories in 
the Reuters dataset. 4 As far as we know this is the best multi-labeled categorization result of the (10 
hy^estxatego 

with Ml feature selection as a baseline for handling BOW-based categorization. Some other recent 

Among these works it is worth mentioning the empirical study by Yang and Liu (1 999) (who showed 
thatiSV'M^ 

and small training sets) and the theoretical account of Joachims (2001) for the suitability of SVM 
for text categorization. 

]:.; : Specifically; we refetto: precision/^call-based iperfqiroanceii^asijres: such as:break-eyenrpomt (BEP) ;and:ftmeasure 
and to uni-labeled and multi-labeled categorization. See Section 5.1 for further details. 

2. Consider, for instance, X c - 1 and X w - 1. Then P(X:M ^ $g$ t A%) - ^> P(X W ) ^ #s where N w (c) is a 
number of occurrences of word w in category c, N(c) is tlie total number of words in c t A' w is a number of occurrences 
of word w in a J) the categories, and N is the total number of words. 

3. Note that throughout the paper we consider categorization schemes that decompose wi-category categori zation prob- 
lems into m binary problems in a standard u one~against-air fashion. Other decompositions based on error correcting 
codes arc also possible; see (Alhvcin et al., 2000) for further details. 

4. It is also shown in (Dumais et al., 1 998) that SVM is superior to other inducers (Rocchio, decision trees, Naive Bayes 
and Bayesian Nets). 

1 185 



BCKKERMAN, EL-YANIV, TlSHBY, AND WlNTHR 



Baker and McCallum (1998) apply the distributional clustering scheme of Pereira et al. (1993) 
(see Section 3) for clustering words represented as distributions over categories of the documents 
where they appear. Given a set of categories C = p a distribution of a word w over the 
categories is {^(^ivv)}/Ii- Then the words (represented as distributions) are clustered using an 
agglomeratiye: ic lustering \ algorithm- : [ : 13 sing: \i: ?iaivei:B ayes : classifier. : (opemted: oi\ these : condition al 
distributions) the authors tested this method for uni-labeled categorization of the 20NG dataset and 
repotted an 85.7% accuracy. They also compare this word cluster representation to other feature 
selection and generation techniques such as Latent Semantic Indexing (see, e.g., Deerwester et al., 
1990), the above Mutual Information index and the Markov "blankets" feature selection technique 
of Keller and Sahami (1996). The authors conclude that categorization that is based on word clus- 
ters:is;:shgh^:le^ 
representation. 

The '''distributional clustering" approach of Pereira et al. (1993) is a special case of the general 
Information Bottleneck (IB) clustering framework presented by Tishby et al. (1999); see Section 3.1 
for further details. Slonim and Tishby (2001) further study the power of this distributional word 
clusters representation and motivate it within the more general IB framework (Slonim and Tishby, 
2000). They show ttiat categorization based on this representation can improve the accuracy over 
the BOW representation whenever die training set is smalt (about 10 documents per category). 
Specifically 

observe 18.4% improvement in accuracy over a BOW-based categorization. 
Joachims:^ 

feature selection, and achieved a break-even point of 86.4%. Joachims (1997) also investigates uni- 
labeled : eategorizatipn 

TFlDF-weighted (see, e.g., Manning and Schiitze, 1999) BOW representation that is reduced using 
the Mutual Information index. He obtains 90.3% accuracy, which to-date is, to our knowledge, the 
best published accuracy of a uni-labeled categorization of the 20NG dataset. Joachims (1999) also 
experiments with SVM categorization of the WebKB dataset (see details of these results in the last 
row in Table 1). 

Schapire and Singer (1998) consider text categorization using a variant oiAdaBoost (Freund and 
Schapire, 1996) applied with one-level decision trees (also known as decision stamps) as the base 
cla ssifiers: xThfc :re suiting: algorithm^ 

of Reuters (ModApie split). Weiss et al. (1999) also employ boosting (using decision trees as the 
base: j classifiers; : and; : an: ■ ad apti ye: ; resampling : scheme): : ; : They \ :categorize \ Reuters ■ : (MqdApte : split) 
with 87.8% BEP using die largest 95 categories (each having at least 2 training examples). To our 
knowledge this is the best result that has been achieved on (almost) the entire Reuters dataset. 

Table 1 summarizes the results that were discussed in this section. 
3, Methods and Algorithms 

The text categorization scheme that we study is based on two components: (i) a representation 
scheme of documents as "distributional clusters" of words, and (ii) an SVM inducer. In this sec- 
tion we describe both components. Since SVMs are rather familiar and thoroughly covered in the 
literature, our main focus in this section is on the Information Bottleneck method and distributional 
clustering. 
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Authors 


Dataset 


Feature 
Selection or 
Generation 


Classifier 


Main Result 


Comments 


Dumais ct al. (1998) 


Routers 


Ml and other 

selection 
mctliods 


SVM, Rocchio, 

Naive Bayes, 
Bayesiaxi nets 


SVM + Ml is 

on 10 largest 
categories 


Our baseline 

for R**iit**r5 

(10 largest 
categories) 


J\.)al>IlllIIA ^ 1 770U J 






CUM 


on.** /o jji..r 




Schapire and Singer 


Reuters 


none 


Boosting 

I DUU5 lv/AlOI ) 


86% BEP 




Weiss etaJ. (1999) 


Reuters 


none 


Boosting of 
decision trees 


87.8% BBP 


Best on 95 
categories 
of Reuters 


Iullg dilU Ll'J \ is.fj) 




none 


5V iYl, KIN IN, 

LLSF, N B 


O V IVi IS DC SI. 

86% F-measure 


95 categories 


Joachims (1997) 


20NG 


Ml over 
TFIDF 


Rocchio 


90,3% accuracy 
(um-Jabeled) 


Our baseline 
for 20NG 


Baker and 
McCaJlum(1998) 


20NG 


Distributional 
clustering 


Naive Bayes 


85.7% accuracy 
(uiii-labeled) 




Sionim and Tishby 
(2000) 


10 cate- 
gories 
of20NG 


Information 
Bottleneck 


Naive Bayes 


Up to 18.4% 
iiripiovexnent over 
BOW on small 
training sels 




Joachims (1999) 


WeblCB 


none 


SVM 


94.2% - "course^ 
79.C% - "faculty" 
53.3% -"project" 
89.9% - "student" 


Our baseline 
for WebKB 



Table 1 : Summary of related results. 



3.1 information Bottleneck and Distributional Clustering 

Data clustering is a challenging task in information processing and pattern recognition. The chal- 
lenge is both conceptual and computational. Intuitively, when we attempt to cluster a dataset,, our 
goal is to partition it into subsets such that points in the same subset are more "similar" to each other 
than to points in other subsets. Common clustering algorithms depend on choosing a similarity mea- 
sure between data points and a "correct" clustering result can be dependent on an appropriate choice 
of :a : ■similarity: ihbia's'ii^ 1 ; : hei :ch6i<:'e' : '6f: i : ■ fdbri^ct" : ixU^as : iiiukt : :d^fihidd : irfclit ivie : itb! : ia : ^particalar 
application. For instance, consider a hypothetical dataset containing art icles by each of two authors, 
so that half of the articles authored by each author discusses one topic, and the other half discusses 
another topic. There are two possible dichotomies of the data which could yield two different bi- 
partitions: according to the topic or according to the writing style. When asked to cluster this set 
into two sub-clusters, one cannot successfully achieve the task without knowing the goal. Therefore, 
without a suitable target at hand and a principled method for choosing a similarity measure suitable 
for the target, it can be meaningless to interpret clustering results. 

The Information Bottleneck (IB) method of Tishby, Pereira, and BiaJek (1999) is a framework 
that can in some cases provide an elegant solution to this problematic "metric selection" aspect of 
data clustering. Consider a dataset given by i.i.d. observations of a random variable X. Informally, 
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the IB method aims to construct a relevant encoding of the random variable X by partitioning X 
into domains that preserve (as much as possible) the Mutual Information between X and another 
"relevance" variable, Y. The relation between X and Y is made known via i.i.d. observations from 
the joint distribution P(X,Y). Denote tlie desired partition (clustering) of A' byX. We determine A' 
by solving the following variational problem: Maximize the Mutual Information I(X. Y) with respect 
to the partition P(.X\X), under a minimizing constraint on I(X,X). In particular, the Information 
Bottleneck: method considers the following optimization problem: Maximize 



over the conditional P(X\X) % where the parameter P determines the allowed amount of reduction 
in information that A" bears on JfjxNaiiiel^ 

minimal partition of X and the maximum preserved information on 7. Tishby et al. (1999) show 
that a solution for this optimization problem is characterized by 



where Z($ y X) is a normalization factor, mdP(Y\X) in theiexpoheniialiis den 



(see Tishby et al. 5 1999, for details). The parameter (3 is a Lagrange multiplier introduced for the 
constrained information, but: using a thermodynamical analogy |3 can also be viewed as an inverse 
temperature, and can be utilized as an annealing parameter to choose a desired cluster resolution. 

Before we continue and present the IB clustering algorithm in the next section, we note on the 
contextual background of the IB method and its connection to "distributional clustering". Pereira, 
Tishby, and Lee (1993) introduced "distributional clustering" for distributions of verb-object pairs. 
Their algorithm clustered nouns represented as distributions over co-located verbs (or verbs repre- 
sented as distributions over co-located nouns). This clustering routine aimed at minimizing the av- 
erage distributional similarity (in terms of the Kullback-Leibler divergence, see Cover and Thomas, 
199 1) between the conditional P(yerb\noun) and the noun centroid distributions (i.e. these centroids 
are also distributions over verbs). It turned out that this routine is a special case of the more gen- 
eral IB framework. IB clustering has since been used to derive a variety of effective clustering and 
categorization routines (see, e.g., Slonim and Tishby, 2001; El -Yaniv and Souroujon, 2001; Slonim 
et al., 2002) and has interesting extensions (Friedman et al., 2001; Chechik and Tishby, 2002). We 
note also that unlike other variants of distributional clustering (such as the PLSI approach of Hoff- 
man, 2001), the IB method is not based on a generative (mixture) modelling approach (including 
their assumptions) and is therefore more robust. 

3.2 Distributional Clustering via Deterministic Annealing 

Given the IB Markov chain condition X X ^ Y (which is not an assumption on the data; see 

Tistoby:;et::ai>^ 
consistent equations: 



I(XJ)-$I(X,X) 





P(X\X) = 



P(X) 



exp 




(2) 
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P(X) - ^P(X)P(X\X); 



(3) 



P(Y\X) = %P{Y\X)P(X\Z). 



(4) 



x 



Tishby et al. (1999) show that a solution can be obtained by starting with an arbitrary solution 
and then iterating the equations. For any value of p this procedure is guaranteed to converge. 5 
Lower values of the (5 parameter (high "temperatures") correspond to poor distributional resolution 
(i.e. fewer clusters) and higher values of |3 (low "temperatures") correspond to higher resolutions 
(i.e. more clusters). 

Input: 

P(XJ) - Observed join t distribmion of two random variables X and Y 

k - desired number of centioids 

$mitit - minimal / maximal values of p 

V > 1 - annealing rate 

8cf»nv > 0 - convergence threshold, Smer%c > 0 • merging threshold 
Output: 

Cluster centroids, given by {P(Y \x.) }* . t 

Cluster assignment probabilities, given byP(X\X) 

Initiate p *r-. $ min - current p parameter 
Initiate r i - current number of centroids 
repeat 

{: 7. "EM "-like iteration: }: 

Compute P{X\X), P(X) im&P(Y\X) using Equations (2), (3) and (4) respectively 
repeat 

Let P 0 m{X\X) +*P{X\X) 

Compute new values for P{X\X), P(X) mdP(Y\X) asing (2), (3) and (4) 
until for each x: \\P(X\x) - P oid (X\x) !| < b conv 
{ 2. Merging: } 

for alt ije \\,r\ s.i. i <j and \\P(Y\Xi) -P(Y\Xj)\\ < iw,^ do 

Mei-ge Xi and iy: / J (£fMQ = PWX) -f- /\£/|A') ' 

Letr /-- r- 1 
end for 

{ 5. CenUvid ghosting: :} 
for all i € [i,r] do 

Create s,t. ||7 > (yjjc r+i ) -.P(y !£,);! — § mense 

\MP(x { \X) <»• i/ , (i/|A') ? / > (jc r+/ |.r; J-P(x/U0 
end for 

Lctr^2r, P^vP 
until r ^ A or |3 > ft„« 

If r > A Ihen merge r ■- A: closest centroids (each to its closest centroid neighbor) 



We use a hierarciiical (op -down clustering procedure for recovering the distributional IB clus- 
ters. A pseudo-code of the algorithm is given in Algorithm 1 . 6 Starting with one cluster (very small 
P) that contains all the data we incrementally achieve the desired number of clusters by performing 
a process consisting of annealing stages. At. each annealing stage we increment p and attempt to 

5. Tiiis procedure is analogous to the Blahut-Arimoto algorithm in information Theory (Cover and Thomas, 1991). 

6. A similar annealing procedure, known as deterministic annealing, was introduced in the context of clustering by 



Algorithm 1: Information Bottleneck distributional clustering 



Rose (1998). 
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split existing clusters. This is done by creating (for each centroid) a new "ghost" centroid at some 
random small distance from the original centroid. We then attempt to cluster the points (distribu- 
tions) using all (original and ghost) centroids by iterating the above IB self-consisting equations, 
similar to the Expectation-Maximization (EM) algorithm (Dempster et ai., 1977). During these iter- 
ations the centroids are adjusted to their (locally) optimal positions and (depending on the annealing 
increment of |3) some "ghost" centroids can merge back with their centroid sources. Note that in 
this scheme (as well as in the similar deterministic annealing algorithm of Rose, 1998), one has to 
use an appropriate annealing rate in order to identify phase transitions which correspond to cluster 
splits. 

An alternative agglomerative (bottom-up) hard -clustering IB algorithm was developed by Slonim 
and Tishby (2000). This algorithm generates hard clustering of the data and thus approximates the 
above IB clustering procedure. Note that the time complexity of this algorithm is 0(n 2 ), where n is 
the number of data points (distributions) to be clustered (see also an approximate faster agglomera- 
tive procedure by Baker and McCailum, 1998). 

The application, of the IB clustering algorithm in our context is straightforward. The variable X 
represents words that appear in training documents. The variable Y represents class labels and 1hus ; 
the joint distribution P(X, Y) is characterized by pairs (vv.c), where w is a word and c is the class 
label of the document where w appears. Starting with the observed conditionals {P(Y = c\X = w)} c 
(giving for each word w its class distribution) we cluster these distributions using Algorithm I. 
Eorapre^ k the output of Algorithm 1 is: (i) k centroids, given by the 

distributions {P(X -~ w\X ••■ w)}w for each word w, where VP are the word centroids (i.e. there are 
k such word centroids which represent k word clusters); (ii) Cluster assignment probabilities given 
by P{X\X). Thus, each word w may (partially) belong to all k clusters and the association weight of 
w to the cluster represented by the centroid w is P(w|>v). 

The time complexity of Algorithm 1 is Q{c\czmri), where c\ is an upper limit on the number 
of annealing stages, c> is an upper limit on the number of convergence stages, m is the number of 
categories and n is the number of data points to cluster. 

In Table 2 we provide an example of the output of Algorithm 1 applied to the 20NG cor- 
pus (sec Section 4.2) with both k — 300 and k - 50 cluster centroids. For instance, we see that 
P(vv' 4 |attacldng) ^ 0.99977 andP(Vvj (attacking) ^ 0.000222839. Thus, the word "attacking" mainly 
belongs to cluster W4. As can be seen, ail the words in the table belong to a single cluster or mainly 
to a single cluster. With values of k in this range this behavior is typical to most of the words in 
this corpus (the same is also true for the Reuters and WebKB datasets). Only a small fraction of 
less;.fhan:jIG^ 

50 ^ k ^. 500. It is also interesting to note that IB clustering often results in word stemming. For in- 
stance, "atom" and "atoms" belong to the same cluster. Moreover, contextuaUy synonymous words 
are often assigned to the same cluster. For instance, many "computer words" such as "computer", 
"hardware 1 ', "ibm", "multimedia", "pc", "processor", "software", "8086" etc. compose the bulk of 
one cluster. 

3.3 Support Vector Machines (SVM's) 

The Support Vector Machine (SVM) (Boser et al, 1992; Scholkopf and Srnola, 2002) is a strong 
inductive learning scheme that enjoys a considerable theoretical and empirical support. As noted in 
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Word 


i Clustering to 300 dusters 


Clustering to SO clusters 


al 


\ *97(l-0) 


W44 (0.996655) (0.00334415) 


ate 


1 *'205 (10) 


w 42 (1.0) 


atheism 


i *S6<1.0) 


tfj (1,0) 


atheist 


i ^6(1.0) 


**(I-0) 


atheistic 




#3(1.0) 


atheists 


! *76(i-0) 


#3(1.0) 


atmosphere 


j vv 20 o(1.6j 


w 3:i (1.0) 


atmospheric 




#33 (1-0) 


atom 


I ^92(1-0) 


#13 (1.0) 


atomic 


j iv 92 (L0) 


>v 35 {i.O) 


atoms 


j K'92(t-0) 


#13 0-0) 


atone 


j ™22) (1.0) 


w 14 (0.998825) # ]3 (0.001 17386) 


atonement 


i *27i (i-0) 


# a (1.0) 


atrocities 


j w 4 (0.99977) w] (0.0M222839) 


#5 (1.0) 


attached 


| W25L 0°) 


#30 (10) 


attack 


1 W71 0-0) 


#28(1-0) 


attacked 


1 w 4 (0.99977) wi (0.000222839) 


wio(l.O) 


attacker 


1 "'103 O-0) 


#23 (1-0) 


attackers 


1 w 4 (0.99977) W] (0.000222839) 


w 5 (1.0) 


attacking 


j w 4 (0.99977) w t (0.000222839) 


ivjod-O) 


attacks 


; w?i (i.0) 


W38(i-0) 


attend 


j *to*(1.0) 


w x5 (10) 


attorney 


j w 91 (1.0) 


#28(1.0) 


attribute 


j #263 0-0) 


#22 (10) 


attributes 


! 1*263 (i-0) 


^22 (10) 



Table 2: A clustering example of 20NG words, vv; are centroids to which the words "belong", the 
centroid weights are shown in the brackets. 



Section 2 there is much empirical support for using SVMs for text categorization (Joachims, 2001; 
Dumais et ah, 1998, etc. ). 

Informally, for linearly separable two-class data, the (linear) SVM computes the maximum mar- 
gin hyperplane that separates the classes. For non-linearly separable data there are two possible 
extensions^ 

cost parameter. The second solution is obtained by implicitly embedding the data into a high (or 
infinite)Vdime^ 

perplane is sought in this high-dimensional space. A combination of both approaches (soft margin 
and embedding) is often used. 

The SVM computation of the (soft) maximum margin is posed as a quadratic optimization 
problem that can be solved in time complexity of 0(kn 2 ) t where /; is the training set size and A- is the 
dimension of each point (number of features). Thus, when applying S VM for text categorization of 
large: datasets ■ : an: effi cien tj representation : of jthe: ;text : pari: jbe j of major, imjportancie; 
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SVMs are well covered by numerous papers, books and tutorials and therefore we suppress 
further descriptions here. Following Joachims (2001) and Dumais et al. (1998) we use a linear 
SVM in all our experiments. The implementation we use is SVMlight of Joachims. 7 

3.4 Putting it All Together 

For handling /n-class categorization problems (m > 2) we choose (for both the uni-labeled and 
multi-labeled settings) a straightforward decomposition into tn binary problems. Although this de- 
composition is not the best for all datasets (see, e.g., Allwein et aL, 2000; Furnkranz, 2002) it 
allows for a direct comparison with the related results (which were all achieved using this decom- 
position as well, see Section 2). Thus, for a categorization problem into m classes we construct m 

multi-labeled categorization (see Section 5.1) experiments we construct for each category a "hard** 

it. On the other hand, in uni-labeled experiments we construct for each category a whjidenc&rai$: 
SVM that output for ^ 

tK&:ou:{put^ 

A major goal of our work, is to compare two categorization schemes based on the two rep- 
resentations: the simple BOW representation together with Mutual Information feature selection 
(called here BOW+MI) and a representation based on word clusters computed via IB distributional 
clustering (called here IB). 

Weiirstico^ 

m categories, for each category c; ; a [ binary : cpnfidence>rated; linear : SVM classifier [ is: trained: -using 
the following procedure: The k most discriminating words are selected according to the Mutual 
Information between the word w and the category c (see Equation ( I)). Then each training document 
of category c is projected over the corresponding k "best" words and for each category c a dedicated 
classifier h e is trained to separate c from the other categories. For categorizing a new (test) document 
d, for each category c we project d over the A: most discriminating words of category c. Denoting a 
projected document d by d C} we compute h c (d c ) for all categories c. The category attributed for d 
is argmaxc h c {d c ). For multi-labeled categorization the same procedure is applied except that now 
we train, for each category c Jiard:(non-confiden^ h c and the subset of categories 

attributed for a test document d is {c : h c {d c ) = 1 }. 

The structure of the IB categorization scheme is similar (in both the uni-labeled and multi- 
labeled settings) but now the representation of a document consists of vectors of word cluster 
counts corresponding to a cluster mapping (from words to cluster centroids) that is computed for 
all categories simultaneously using the Information Bottleneck distributional clustering procedure 
(Algorithm 1). 



7. The SVMUght software can be downloaded at: http://swilight.joachims.org/. 
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4. Datasets 

Three benchmark datasets - Reuters-21578, 20 Newsgroups and WebKB - were experimented with 
in our application of feature selection for text categorization, in this section we describe these 
datasets and the preprocessing that was applied to them. 

4.1 Reolers-21578 

T he Reuters-2 1 578 corpus contains 21578 articles taken from the Reuters newswire. 8 Each article is 
typically designated into one or more semantic categories such as "earn", "trade", "com" etc., where 
the total number of categories is 1 14, We used the ModApte split, which consists of a training set 
of 7063 articles and a test set of 2742 articles. 9 

In both the training and test sets we preprocessed each article so that any additional information 
except for the title and the body was removed. In addition, we lowered the case of letters. Following 
Dumais et al. (1998) we generated distinct features for words that appear in article titles. In the 
IB^hase&lsetu]^ 

that appear in Wi ow _ f req articles or less, where Wu m _f req is determined using cross-validation (see 

Sectiw5;2);;in^^ 

sincethes^ 

4.2 20 Newsgroups 

The 20 Newsgroups (20NG) corpus contains 19997 articles taken from the Usenet newsgroups 
collection. 10 Each article is designated into one or more semantic categories and the total number 
of categories is 20, all of them are of about the same size. Most of the articles have only one 
semantic label, while about 4.5% of the articles have two or more labels. Following Schapire and 
Singer X20QQ):we;:use^ 

and to remove duplications. We preprocessed each article so that any additional information except 

for>the:;subject 

o£:binary^ 

"binary" (or a delimiter) if it is longer than 50 symbols and contains no blanks. Overall we removed 
23057 such lines (where most of these occurrences appeared in a dozen of articles overall). Also, 
we| towered : tlie : case: ;^ 

on low-frequency words, using the parameter W'tw-frtq determined via cross-validation. 

4.3 WebKB: World Wide Knowledge Base 

The World Wide Knowledge Base dataset (WebKB) 11 is a collection of 8282 web pages obtained 
from four academic domains. The WebKB was collected by Craven et al. (1998). The web pages 
iiv ;t lie: i WebKBi : set 
anid;the -second^ 

8. Reuters-21578 can be found at: http://www.daviddlewisxorn/rcsoijTcea/testcollectjoiis/rcutei , s21 578/. 
*>.' : Note ; tha t: m '■ .these : figured ;we; count ;docii meats ; with ;at least; one ; label;: ; : The: .original; split ; contains; 9603; ^training 
documents and 3299 test documents where the additional articles have no labels. While in practice it may be possible 
to utilize additional unlabeled documents for improving performance using semi-supervised learning algorithms (see, 
e.g., El-Yaniv and Souroujon, 2001), in tius work we simply discarded these documents. 
10. The 20 Newsgroups can be found at: http://kddies.uri.edu/cUtabasesCT 
i I. WebKB can be found at: http://www-2.cs.cmu.edu/afs/cs.cmu.edu/project/thea-l lAvww/wwkb/. 
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chotomy, which consists of 7 categories: course, department, faculty, project, staff, student and 
other. Following Nigam ct ai. (1998) we discarded the categories other? 1 department and staff. 

4 remaining categories and their sizes. 



Category 


Number of articles 


Proportion (%) 


course 


930 


22.1 


faculty 


1124 


26.8 


project 


504 


12.0 


student 


1641 


39.1 



Table 3: Some essential details of WebKB categories. 

Since the web pages are in HTML format, they contain much non-textual information: HTML 

tags, : :li^:^ 

instance, in some documents anchor-texts of URLs are the only discriminative textual information. 

We;did;ho^ 

in: the I Bfbased : setup ; : we ; applied \ a \ filter [ on: Towfirequeiicy. : words, :using :(he; parameter Wj ttw „freq 
(determined via cross-validation). 

5. Experimental Seiup 

This section presents our experimental model, starting with a short overview of the evaluation meth- 
ods we used. 

5.1 Optimally Criteria and Performance Evaluation 

We are given a training set D irain — {(dy i £x)....,{d„ i £ n )} of labeled text documents, where each 
document <*,• belongs to a document set D and the label «» £i(dj) of d{ iSi withLn a predefined set 
of categories C — ■ {c\ . . . . ,c WJ }. in the multi-labeled version of text categorization, a document can 
belong to several classes simultaneously. That is, both h(d) and (1(d) can be sets of categories rather 
than single categories. In the case where each document has only a single label we say that the 
categorization is um-labeled. 

We measure the empirical effectiveness of mul ti-labeled text categorization in terms of the clas- 
sical information retrieval parameters of "precision" and "recall" (Baeza-- Yates and Ribeiro-Neto. 
1 999). Consider a multi-labeled categorization problem with m classes, C - {c\ , . . . . c m }. Let h be 
a classified d, let h(d) C- C be the set of categories 

designated by // for d. Let ((d) G C be true categories ofd. Let D teM cz 0 be a test set of '"unseen" 
documents that were not used in the construction of h. For each category c:^:defme:l^ 
quantities: 

TPi - X Ac, €A(tf)l. 

12. Note however that other is the largest category in WebKB and consists about 45% of this set. 
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where /[•] is the indicator function. For example, FPi (the "false positives" with respect to ci) is the 
number of documents categorized by h into c; whose true set of labels does not include c,-, etc. For 
each category Ci we : :iiow:defind" : the: : pre P, = Pi(h) of h and the recall Ri=Rj(h) with respect 
to Ci as Pi = T p.^ FPi and = yjffij? r The overall micro-averaged precision P = P(h) and woz// 
7? — 7?(/z) of h is a weighted average of the individual precisions and recalls (weighted with respect 
to the sizes of the test set categories). That is, P ~ ^jjp^F^) anc * R " ^(TP^m) - Dlie 10 the 
natural tradeoff between precision and recall, the following two quantities are often used in order to 
hraas!^;the:p^ 

• F-measure: The hartnonic mean of precision and recall; that is F — xtp+xjn ' 

« Break-Even Point (BEP): A>fl enable 
tweenj:|^^^ (and R) satisfying P = R is 

called the break-even point (BEP). Since it is time consuming to evaluate the exact value of 
the BEP it is customary to estimate it using the arithmetic mean of P and R. 

The above performance measures concern multi-labeled categorization. In a uni-labeled categoriza- 
tion the accepted performance measure is acciiracy-'i^ititaio 

documents in D test : :Specifiea^ and PJ K d) are singletons (i.e. uni-labeiing), 

the accuracy Accih) of ft is Acc{h) — rjj^ Y<d€D te , s 1*(<0 — ^(^)3* * s ^ not to see l ^ at ' n 
case the accuracy equals the precision and recall (and the estimated break-even point). 

Following Dumais et al. (1998) (and for comparison with this work), in our rnidti-labeled ex- 
periments (Reuters and 20NG) we report on micro-averaged break-even point (BEP) results. In 
our uni-labeled experiments (20NG and WebKB) we report on accuracy. Note that we experiment 
with both uni-labeled and multi-labeled categorization of 20NG. Although this set is in general 
multi-labeled, the proportion of multi-labeled articles in the dataset is rather small (about 4.5%) and 
therefore a uni-labeled categorization of this set is also meaningful . To this end, we follow Joachims 
(1997) and consider our (uni-labeled) categorization of a test document to be correct if the label we 
assign to the document belongs to its true set of labels. 

In order to better estimate the performance of our algorithms on test documents we use standard 
cross-validation estimation in our experiments with 20NG and WebKB. However, when experi- 
menting with Reuters, for compatibility with the experiments of Dumais et al. we use its standard 
ModApte split (i.e. without cross-validation). In particular, in both 20NG and WebKB we use 4- 
fold cross-validation where we randomly and uniformly split each category into 4 folds and we took 
three folds for training and one fold for testing. Note that this 3/4:1/4 split is proportional to the 
training to test set size ratios of the ModApte split of Reuters. In the cross-validated experiments 
we always report on the estimated average (over the 4 folds) performance (either BEP or accuracy), 
estimated standard deviation and standard error of the mean. 

5.2 Hyperparameter Optimization 

A major issue when working with SVMs (and in fact with almost all inductive learning algorithms) 
is parameter tuning. As noted earlier (in Section 3.3), we used linear SVtAlight in our implemen- 
tation. The only relevant parameters for the linear kernel we use are C (trade-oif between training 
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error and margin) and J (cost- factor, by which training errors on positive examples outweigh er- 
rors on negative examples). We optimize these parameters using a validation set that consists one 
third of the three-fold training set. 13 For eac^ 

values 14 and in general, we attempt to test performance (over the validation set) using all possible 
combinations of parameter values over the feasible sets. 

Note that tuning the parameters C and / is different in the multi-labeled and um-labeled settings. 

Ifr:the:m^ 

dently :of :the; :othef : classified 

we use the max-win decomposition, the categorization of a document is dependent on all the binary 

classifiers:!^ 

fier: can generate : confi dehoe : rates- tivat :are: maximal for all t he document s ;| which ;resiilts -in: extremely 
poor : performance^ : ; :T : herefpre> \ a \ global Ruining : o f : ail \ the \ binary: ^classifiers \ is: liiec^ssaryJ : \ !N eyerthe r 
]ess^in:fte:c^ 

time-consuming and. ideally, a clever search in this high dimensional parameter space should be 
considered. Instead, we simply used the information we have on the 20NG categories to reduce the 
size:of the ; parameter: space;|:Specifi^ 

correlated ones and we split the list of the categories into 9 groups as in Table 4. 15 For each group 
the parameters are timed together and independently of other groups. This way we achieve an ap- 
proximately global parameter tuning also on the 20NG set. Note that the (much) smaller size of 
WebKB (both number of categories and number of documents) allow for global parameter tuning 
over the feasible parameter value sets without any need for approximation. 



Group 


Content 


1 


(a) talk.retigion.misc; (b) soc.religionxhristiau (c) alt.atheism 


2 


(a) rec.sport.hockey; (b) rec.sportbaseball 


3 


(a) taik.politics.mideast 


4 


(a) scLmed; (b) talk.poiitics.guns; (c) talk.politics.misc 


5 


(a) rec.autos; (b) rec.motorcycles; (c) sci.space 


6 


(a) comp.os.ms-windows.misc; (b) comp.graphics; (c) comp.windows.x 


7 


(a) scielectroiiics; (b) comp.sys.mac.harchvare; (c) comp.sys.ibm.pe.hardware 


8 


(a) sci.crypt 


9 


(a) misc.forsaie 



Table 4: A split of the 20NG ? s categories into thematic groups. 

In IB categorization also the parameter Wu^j m (seje:S.ectj^ 
low-tte^ 

rization we search for both the SVM parameters and Wi 0Wmm f rH(r To reduce the time complexity we 
emplOy:the:fo^ and J and then, using 



13. Duinais el aJ. (1998) also use a 1/3 random subset of the training set for validated parameter tuning. 

i4;;SpcxilicaUy;:rorthe Cparametcr the feasible set is {1(T 4 ; 10~ :i . 10~ 7 \ 10" ] } and for J it is {0.5. 1,2 10}. 

1 5. It is important to note that an almost identical split, can be computed in a completely unsupervised manner using the 
Multivariate Information Bottleneck (see Friedman et al. t 2001, for (urdier details). 
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the validation set, we optimize W{ 0X ^f rcq }^ After determining Wiaw»freq we tune both C and J as 
described above. 17 

5.3 Fair vs. Unfair Parameter Tuning 

In our experiments with the BOW+MI and IB categorizers we sometimes perform unfair parameter 
tuning in which we tune the S VM parameters over the test set (rather than the validation set). If 
a categorizer A achieves better performance than a categorizer B while S's parameters were tuned 
unfairly (and A 's parameters were tuned fairly) then we can get stronger evidence that A performs 
better than/?. In our experiments we sometimes use this technique to accentuate differences between 
two categorizers. 

6. Categorization Results 

We compare text categorization results of the IB and BOW+MI settings. For compatibility with the 
original BOW+MI setting of Dumais et al. (199S), where the number of best discriminating words 
k is set to 300, we report on results with k — 300 for both settings. In addition, we show BOW+MI 
results with A- — 15,000, wliich is an example for a big value oft that led to good categorization 
results in the tests we performed. We also report on BOW results without applying MI feature 
selection. 

6,1 Multi-Labeled Categorization 

Table 5 summarizes the multi-labeled categorization results obtained by the two categorization 
schemes (BOW+MI and IB) over Reuters (10 largest categories) and 20NG datasets. Note that 
the 92.0% BEP result for BOW+MI over Reuters was established by Dumais et al. (1998). 18 To the 
best of pur : knowl^ 

labeled categorization, of this dataseL Previous attempts at multi-labeled categorization of this set 
were performed by Schapire and Singer (2000), but no overall result on the entire set was reported. 

Oti 20NG the advantage of the IB categorizer over BOW+MI is striking when k — 300 words 
(and k - 300 word clusters) are used. Note that the 77.7% BEP of BOW+MI is obtained using 
unfair parameter tuning (see Section 5.3). However, this difference does not sustain when we use 
k — l5,0QQ:Words:.>Usi^ 

increases to 86.3% (again, using unfair parameter tuning), which taking into account the statistical 
deviations is similar to the IB BEP performance. The BOW+MI results that are achieved with lair 
parameter tuning show an increase in the gap between the performance of the two methods. Never- 
theless, the IB categorizer achieves this BEP performance using only 300 features (word clusters), 
almost two order of magnitude smaller than 15,000. Thus, with respect to 20NG, the IB categorizer 
outperforrr^ 

We also tried other values of the k parameter, where 300 < A: 15.000 and k > 1.5.000. We found 

16. The set of feasible Wi ()W _f n; q values we use is {0. 2. 4, 6, 8}. 

1 7. The "optimal" determined value of l¥; mv _j rcg for Renters is 4, for WebKB (across all folds) it is 8 and for 20NG it 
is 0. The number of distinct words alter removing low- frequency words is: 9.953 for Reuters {Wi ow _f r&J — 4), about 
i i 0,000 for 20NG (WW../«9 = °) and about 7,000 for WebKB {W iow ..f req = 8), depending on the fold. 

! 8. This result was achieved using binary BOW representation, see Section 2, We replicated Dumais ei al.\ experiment 
and in fact obtained a slightly higher BEP result of 92.3%. 
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Categorizer 


Reuters (SEP) 


r 20NG(BEP) 


BOW+MI 


92.0 


76.5*0.4 (0.25) 


k ■■■■■ 300 


obtained by Dumais e.t ai. (1998) 


77.7 it 0.5 (0.31) unfair 


BOW+MI 


92.0 


85.6.* 0.6 (0.35) 


*■- 15000 




86.3 ± 0.5 (0.27) unfair 


BOW 


89.7 


86.5*0.4 (0.26) unfair 


IB 


91.2 


88.6 ±0.3 (0.21) 


A: = 300 


92.6 unfair 





Table 5: Multi-labeled categorization BEP results for 20NG and Reuters, k is the number of se- 
lected words or word-clusters. All 20NG results are averages of 4-fold cross-validation. 
Standard deviations are given after the symbol and standard errors of the means are 
given in brackets. "Unfair" indicates unfair parameter tuning over the test sets (see Sec- 
tion 5.3). 



that the learning curve, as a function of k, is monotone increasing until it reaches a plateau around 
k= 15.000. 

We repeat the same experiment over the Reuters dataset but there we obtain different results. 
Now the IB categorizer lose its BEP advantage and achieves a 91.2% BEP, 19 a slightly inferior 
(but quite similar) performance to the BOW+MI categorizer (as reported by Dumais et aL, 1998). 
Note: that: the: BOW+MI^ 

* — 15,000. Furthermore, using all features led to a decrease of 2% in BEP. 



Categorizer 


WebKB (Accuracy) 


20NG (Accuracy) 


BOW+MI 
k = 300 


92.6*0.3(0.20) 


84.7*0.7(0.41) 
35.5*0.7 (0.45) unfair 


BOW+MI 
k = 15000 


92.4*0.5(0.32) 


90.2*0.3(0.17) 
90.9*0.2(0.12) unfair 


BOW 


92.3*0.5(0.40) 


91.2*0.1 (0.08) unfair 


IB 

k = 300 


89.5*0.7(0.41) 
91.0*0.5 (0.32) unfair 


91.3*0.4(0.24) 



Table 6: Uni-labeled categorization accuracy for 20NG and WebKB. k is the number of selected 
words or word-clusters. All accuracies are averages of 4-fold cross-validation. Standard 
deviations are given after the symbol and standard errors of the means are given in 
brackets. "Unfair' indicates unfair parameter tuning over the test sets (see Section 53). 



19. Using unfair parameter tuning the IB categorizer achieves 92.6% B£P. 
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6.2 Uni-Labeled Categorization 

We also perform uni-iabeled categorization experiments using the BOW+MI and IB catcgorizers 
oyerj20N^ 

qualitatively similar to the multi-labeled results presented above with WebKB replacing Reuters. 
Here again, over the 20NG set, the IB categorizer is showing a clear accuracy advantage over 
BOW * MI with k — 300 and this advantage is diminished if we take k — 15.000. On the otiier 
hand, we observe a comparable (and similar) accuracy of both categorizers over WebKB, and as it 

is;with:Re^ 
set size. 

The use of A = 300 word clusters in the IB categorize r is not necessarily optimal. We also 
performed this categorization experiment with different values of k ranging from 100 to 1000. The 
categorization accuracy slightly increases when k moves; :firoM 
change when k > 200. 

The categorization results reported above show that the performance of the BOW+MT. categorizer 
and the IB categorizer is sensitive to the dataset being categorized. What makes the performance of 
these two categorizers different over different datasets? Why does the more sophisticated IB cate- 
gorizer outperform the BOW+MT categorizer (with either higher accuracy or better representation 
efficietycyioyer^ 

attempt to identify differences between these corpora that can account for this behavior. 

One possible approach to quantify the complexity of a corpus with respect to a categorization 
system is to observe and analyze learning curves plotting the performance of the categorizer as a 
function of the number of words selected for representing each category. Before presenting such 
learning curves for the three corpora, we focus on the extreme case where we categorize each 
of the cotpora using only the three top words per categoiy (where top-scores are measured using 
the Mutual Information of words with respect to categories). Tables 7, 8 and 9 specify (for each 
corpus) a list of the top three words for each category, together with the performance achieved by 
ihfcBQW+M^ 
performance ;pf : B 

corpus). For instance, observing Table 7, computed for Reuters, we see (hat based only on the words 
"vs" "cts' 1 and "loss" it is possible to achieve 93.5% BEP when categorizing the category earn. We 
note that the word "vs" appears in 87% of the articles of the category earn (i.e., in 914 articles 
among total 1 044 of this category). This word appears in only 1 5 non-earn articles in the test set 
and therefore "vs" can, by itself, categorize earn with very high precision. 20 This phenomenon was 
already :notioed; by 

can lead to extremely high accuracy when distinguishing between the Reuters category wheat and 
the other categories (within a uni-labeled setting). 21 The difference between the 20NG and the two 
other corpora is striking when considering the relative improvement, in categorization quality when 
increasing the feature set; up to 1.5,000 words. Wliile one can dramatically improve categorization 

20. In the training set the word "vs" appeal s in 1900 of die 2709 earn articles (70. 1%) and only in 14 of the 4354 non-aim 
articles (0.3%). 

21 . When using only one word per category, we observed a 74,6% BEP when categorizing Reuters ( J 0 largest categories), 
66.3% accuracy when categorizing WebKB and 34.6% accuracy when categorizing 20NG. 
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of 20NG by over 150% with many more words, we observe a relative improvement of only about 
15% and 26% in the case of Reuters and WebKB, respectively. 



50 



30 











~"^*~\ 1 

* 






Reuters 

— 20 newsgroups 
--- WebKB 


* : : 





4 5 6 7 
number of features 



9 10 



(a) 



j ( 1 1 1 r 


I'' ^-^". .••^•-•t f ■'* 

1 S 

)r* r : i 

i * : : I 
f : 

>-/ i \ \ 

i : 

\\ \ \ \ 




if r \ j 


Reuters 

— 20 newsgroups 

- -- WebKB 


1 1 1 1 



50 



100 150 200 
number of features 



250 



300 



(b) 



figure 1 : Learning curves (BEP or accuracy vs. number of words) for the datasets: Reuters-21578 
(multi-labeled, BEP), 20NG (uni-labeled, accuracy) and WebKB (uni-iabeled, accuracy) 
over the Mi-sorted top 10 words (a) and the top 300 words (b) using the BOW+MI cate- 
gorizer. 



Category \ 


1st word 


2nd word 


3rd word 


BEP on 


BEP oft 


Relative 










3 words 


15000 words 


Improvement 


earn j 


\ vs-i- 


ctsf 


loss 4- 


93,5% 


98.6% 


5.4% 


acq ] 


slxai*es-i- 


vs- 


lnc-j- 


76.3% 


95.2% 


24.7% 


money-fx | 


I doliar-f 


2z J 


exchange-f- 


53.8% 


80.5% 


49.6% 


grain j 


wheat-f 


lormes-f 


giam-j- 


77.8% ] 


88.9% 


14.2% 


crude j 


oil-r 


bpd+ 


OPEC+ 


73.2% 


86.2% 


17,4% 


trade 1 


| tmde-h 


vs- 


ct&— 


67.1% 


76.5% 


14.0% 


interest ! 


1 rates 4- 


rate4- 


vs ™ 


57.0% 


76.2% 


33.6% 


ship j 


| ships-4- 


VS" 


strike-!- 


64. J% 


75.4% 


17.6% 


wheat | 


wheat 4- 

i 


tonnesf 


WHEAT-T- 


87.8% 


82.6% 


-5.9% 


com j 


corn-r 




tonrtes-h 


VS - 


70.3% 


83.7% 


19.0% 


Average jj j j 


79.9% 


92.0% 


15.1% 



Table 7: Reuters: Three best words (in terms of Mutual information) and their categorization BEP 
rate of the 10 largest categories, near a word means that the appearance of the word 
predicts the corresponding category, means that the absence of the word predicts the 
category. Words in upper-case are words that appeared in article titles (see Section 4.1). 
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Category j 


1st word 


2nd word 


3rd word "['Accuracy on 


Accuracy on 


Relative 










3 words 


15000 wards 


Improvement 


course i 


courses 


course 


homework 


79.0% 


95.7% 


21.1% 


faculty | 


professor 


cite 


PP 


70.5% 


89.8% 


27.3% 


project | 


projects 


umd 


berkeley 


" 53.2% 


SO.8% 


51.8% 


student i 


corn 


uci 


homes 


78.3% 


95.9% 


22.4% 


Average j 


1 


| 73.3% 


92.4% 


26.0% 



Table 8: WebKB: Three best words (in terms of Mutual Information) and their categorization ac- 
curacy rate of the 4 representative categories. A1J the listed words contribute by their 
appearance, rather than absence. 



Category 

\ 

! 


\ 1st word 

i 

i 
i 


2nd word 


3rd word 


Accuracy 
on 3 
words 


Accuracy 
on 15000 
words 


Relative 
Improvement 


ait.athcism 


| atheism 


atheists 


morality 


487% 


S4-8% 


74.1% 


comp .graphics 


j image 


jpeg 


graphics 


40.5% 


83.1% 


105.1% 


comp.os.ins- 


j windows 


m 


0 


60.9% 


84.7% 


39.0% 


windowsjuisc 


s 












comp.sys.ihm. 


j scsi 


drive 


ide 


13.8% 


76.6% 


455.0% 


pc .hardware 














comp.sys.mac. 


mac 


apple 


centris 


61.0% 


86.7% 


42.1% 


hardware 














comp. windows, x 


window 


server 


motif 


46.6% 


86,7% 


86.0% 


misc.forsale 


00 


sale 


shipping 


63.4% 


87.3% 


37.6% 


ree.autos 


car 


cars 


engine 


62.6% ' 


89.6% 


44.5% 


rec.motoreycies 


+ bike 


dod 


ride 


77.3% 


94.0% 


21.6% 


rec . spor l.basebaJi 


baseball 


game 


year 


38.2% 


95.0% 


148.6% 


rec.sport.hoekey 


1 hockey 


game 


team 


67.7% 


97.2% 


43.5% 


sci.crypt 


key 


encryption 


clipper 


76.7% 


95.4% 


24.3% 


sei.eiec ironies 


circuit 


wire 


wiring 


15.2% 


85,3% 


461.1% 


sci.med 


cancer 


medical 


msg 


26.0% 


92.4% 


255.3% 


sci. space 


space 


nasa 


orbit 


62.5% 


94.5% 


51.2% 


soc .religion, christian 


god 


church 


sin 


50.2% 


91.7% 


82.6% 


talk.poHtics.guns 


gun 


guns 


firearms 


41.5% ' 


87.5% 


110.8% 


talk.poliucs.mideast 


israel 


armenian 


turkish 


54.8% 


94.1% 


71.7% 


talk.poUtics.misc 


cramer 


president 


ortilink 


23.0% 


67.7% 


194.3% 


talk.religion.mjse 


jesus 


god 


jehovah 


6.6% 


53.8% 


7J5.l% 


Average 








46.83% 


86.40% 


.153.23% 



Table 9: 20NG: Three best words (in terms of Mutual Information) and their categorization accu- 
racy rate (uni-labcled setting). Ail the listed words contribute by their appearance, rather 
than absence. 

In Figure 1 we present, for each dataset, a learning curve plotting the obtained performance of 
the BOW+MI categorizer as a function of the number k of selected words. 22 As can be seen, the two 

22. In the case of Reuters and 20NG the performance is measured in terms of BEP and in the case of WebiCB in terms; of 
accuracy. 
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curves of both Reuters and WebKB are very similar and almost reach a plateau with k = 50 words 
(that were chosen using the greedy Mutual Information index). This indicates that other words 
do not contribute much to categorization. But the learning curve of 20NG continues to rise when 
0 < fc <: 300, and still exhibits a rising slope with k ■ 300 words. 

dataset on the one hand, and of the Reuters and WebKB datasets, on the other hand. We identify 
another interesting difference between the corpora. This difference is related to the hyper-parameter 
Wiay^/req (see Section 4). The bottom line is that in the case of 20NO IB categorization improves 
when. Wiow.freq decreases while in the case of Reuters and WebKB it improves when Wiaw.freq 
increases. In other words, more words and even the most infrequent words can be useful and 
improve the (IB) categorization of 20NG. On the other hand, such rare words do add noise in the 
(TB)jGategpn^ 

the three corpora as a {unction of W' iC ^_f req , Note again that this opposite sensitivity to rare words is 
observed with respect to the IB scheme and the previous discussion concerns the BOW+MI scheme. 
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Figure 2: Performance of the IB categorizer as a function of the Wi ow ^f req parameter (that specifies 
the threshold of the low frequency word filter: words appearing in less than Wi ow _f rcq 
articles are removed); uni - labeled categorization of WebKB and 20NG (accuracy), mul ti- 
labeled categorization of Reuters (BEP). Note that Wi 0w _fr*q '" 0 corresponds to the case 
where this filter is disabled. The number of word clusters in all cases is k = 300. 



8. Computational EtYorts 

Wc performed all our experiments using a 600MHz 2G RAM. dual processor Pentium III PC oper- 
ated by Windows 2000. The IB clustering software, preprocessed datasets and application scripts 
can be found at: 



http://www.cs.technion.ac.il/~ronb 
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The computational bottlenecks were mainly experienced over 20NG, which is substantially larger 
than Reuters and WebKB. 

categorizer, the computational bottleneck was the SVM training, for which a single run (one of the 
4 cross-validation folds, including both training and testing) could take a few hours, depending on 
the parameter values. In general, the smaller the parameters C and J are, the faster the SVM training 
is. 23 

As for the IB categorizer, the SVM training process was faster when the input vectors consisted 
of word clusters. However, the clustering itself could take up to one hour for each fold of the entire 
20NG set, and required substantial amount of memory (up to 1G RAM). The overall training and 
testing time over the entire 20NG in the multi-labeled setting was about 16 hours (4 hours for each 
ofthe4folds). 

The computational bottleneck when running uni-labeled experiments was the SVM parameter 

t;uning;!:;U:required| a: repetition 

(see Section 5.2). Overall the experiments with the IB categorizer took about 45 hours of CPU time, 
while the BOW-MI categorizer required about 96 hours (i.e. 4 days). 

The experiments with the relatively small WebKB corpus were accordingly less time-consuming. 
In particular, the experiments with the SVM+MI categorizer required 7 hours of CPU time and those 
with the IB categorizer, about 8 hours. Thus, when comparing these tunes with the experiments on 
20NG we see that the IB categorizer is less time-consuming than the BOW+MI categorizer (based 
on 15000 words) but the clustering algorithm requires larger memory. On Reuters the experiments 
ran even foster, because there was no need to apply cross-validation estimation. 

9* Concluding Remarks 

In this study we have provided further evidence for the effectiveness of a sophisticated technique for 
document representation using distributional clustering of words. Previous studies of distributional 
clustering of words remained somewhat inconclusive because the overall absolute categorization 
performance; w^ 

best of our knowledge, in all pervious studies of distributional clustering as a representation method 
for: stjperv'i^ed: text ;ca tegoriiition;: :the: -class ifter : tiseci : was: Naive: B ayes)'. 

We show that when Information Bottleneck distributional clustering is combined with an SVM 
classing 

benchmark datasets. In particular, on the 20NG datasel, with respect to either multi-labeled or uni- 
labeled: categorization- i we: obtain either: accuracy: (BE?) or: representation effi ci'ency advantages : over 
BOW wheti the categorization is based on SVM. This result indicates that sophisticated document 
represe tita t to ris : icah : s ignificantly: ioutperfomi : the s tihdard: BOW if eprese nta tio h : and: achieve: :s ta te-df-- 
the-art performance. 
Neyerifte^ 

generation technique when categorizing the Reuters or WebKB corpora. Our study of the three cor- 
pora: ishows ! striictu^ 

can be categorized with close to "optimal" performance using a small set of words, where the ad- 
ditiori ;of many; thousands-mc^ 

23. S VjM/igA/ and its parameters are described by Joachims ( 1 998a). 
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that the "complexity 1 * of the 20NG corpus is in some sense higher than that of Reuters and WebKB. 
Iiiiad&itio^ 

words when it is applied with tire 20NG corpus. On the other hand, such infrequent words do not af- 
fect or even degrade the performance of the IB categorizer when applied to the Reuters and WebKB 
corpora. 

Based on our experience with the above corpora we note that when testing complex feature 
selection;*?]^ 

elusions based only on "low-complexity" corpora such as Reuters and WebKB. It seems that so- 
phisticated representation methods cannot outperform BOW on such corpora. 

Let us conclude wi th some questions and directions for future research. Given a pool of two or 
more representation techniques and given a corpus, an interesting question is whether it is possible 
to combine them in a way that will be competitive with or even outperform the best technique in the 
pool A straightforward approach would be to perform cross-validated model selection. However, 
this approach will be at best as good as the best technique in the pool. Another possibility is to try to 
combine the representation techniques by devising a specialized categorizer for each representation 
and then use ensemble techniques to aggregate decisions. Other sophisticated approaches such as 
"co-training'' (see, e.g., Blum and Mitchell, 1998) can also be considered. 

Our application of the IB distributional clustering of words employed document class labels 
but generated a global clustering for all categories. Another possibility to consider is to generate 
socialized clusterm^ 

clustering of /7-grams, with 1 ^ tr ^ N for some small N. 

Another interesting question that we did not explore concerns the behavior of IB and BOW 7 
representations when using feature sets of small cardinality (e.g. k — 10). It is expected that at least 
in "complex" daiasets like 20NG, there should be an advantage to the IB representation also in this 
case. 

The BOW+MI categorization employed Mutual Information feature selection, where the num- 
ber k of features (words) was identical for all categories. It would be interesting to consider a 
specialized k for each category. Although it might be hard to identify good set of vocabularies, 
this : appfoa ch : ftiay : lead : to somewhat : better : da^gotiiatibii ; and : :is: : likbly; to : generate : riiore ^ffi&eht 
representations. 

In all our experiments we used the simple-minded one-against-all decomposition technique. It 
would be interesting to study other decompositions (perhaps, using error correcting output coding 
approaches). The inter-relation between feature selection/generation and the particular decomposi - 
tion is of particular importance and may improve text categorization performance. 

We computed our word clustering using the original top-down (soft) clustering IB implemen- 
tation of Tishby et al. (1999). It would be interesting to explore the power of more recent IB 
imple inent at ions: I ih this: 'context: : Specifically , : : the' : IB icftis t ering : methods: idescr ibeid: by: El-Yaniv and 
Souroujon (2001) and Slonim et al. (2002) may yield better clustering in the sense that they tend to 
better approximate the optimal IB objective. 
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